1 SAS Foundation Interfaces for Hadoop

Tool Purpose Product
FILENAME statement Allow the DATA step to read and write HDFS data files. Base SAS
PROC HADOOP Copy or move files between SAS and Hadoop. Execute MapReduce and Pig code. Execute Hadoop file system commands to manage files and directories. Base SAS
SQL Pass-Through Submit HiveQL queries and other HiveQL statements from SAS directly to Hive for Hive processing. Query results are returned to SAS. SAS/ACCESS Interface to Hadoop
LIBNAME Statement For Hadoop Access Hive tables as SAS data sets using the SAS programming language. SAS/ACCESS engine translates SAS language into HiveQL and attempts to convert the processing into HiveQL before returning results to SAS. SAS/ACCESS Interface to Hadoop

2 SAS In-Memory Analytics Interfaces for Hadoop

2.1 A Hadoop Cluster to Run SAS

2.2 Base SAS: FILENAME for Hadoop

2.3 Base SAS: The Same Process for PROC HADOOP

2.4 SAS/ACCESS: SQL Pass-Through and LIBNAME

2.5 In-Memory Analytics

2.6 Computing Environment

3 Managing Files and Executing Hadoop Commands, Map-Reduce, and Pig

4 Part 1: Introduction to Base SAS Methods for Hadoop

4.1 Base SAS Tools for Hadoop

  • The Hadoop FILENAME statement enables you to do the following:
    • upload local data to Hadoop using the DATA step
    • read data from Hadoop using the DATA step
  • PROC HADOOP enables you to do the following:
    • submit Hadoop file system (HDFS) commands
    • submit MapReduce programs
    • submit Pig script

4.2 The Hadoop Config.xml File

  • The FILENAME statement for Hadoop and PROC HADOOP require an option that specifies a Hadoop configuration file (config.xml).
    • The configuration file defines how to connect to Hadoop.
    • The file must be accessible to the SAS client application.
    • A SAS administrator commonly manages this configuration for the SAS users.
    • This file is often referred to as the Hadoop core-site.xml file.

4.3 Hadoop JAR Files

  • A collection of Hadoop JAR files is also required on the SAS client machine.
    • An environment variable SAS_HADOOP_JAR_PATH on the SAS client machine defines the location of the Hadoop JAR files
    • The Hadoop JAR files must be compatible with the specific Hadoop implementation and can be copied from the Hadoop server.
    • A Hadoop system administrator commonly manages the configuration of the Hadoop JAR files for the SAS users.

4.4 Base SAS Interface to Hadoop

5 Part 2: The HADOOP FILENAME Statement and PROC HADOOP

5.1 Business Scenario

  • We want to develop a prototype for a process that uses SAS to orchestrate the following scenario:
    1. Move unstructured text files into the Hadoop file system.
    2. Invoke MapReduce programs developed by Java programmers in order to:
      • read and process the text files to perform various analyses
      • output results as text files in the Hadoop file system
    3. Read the summarized text analysis results back into SAS for further analysis and reporting purposes.

5.1.1 Breakdown 1

  1. Move unstructured text files into the Hadoop file system.
  • PROC HADOOP and the HDFS COPYFROMLOCAL statement

5.1.2 Breakdown 2

  1. Invoke MapReduce programs developed by Java programmers in order to:
    • read and process the text files to perform various analyses
    • output results as text files in the Hadoop file system
  • PROC HADOOP and the MAPREDUCE statement

5.1.3 Breakdown 3

  1. Read the summarized text analysis results back into SAS for further analysis and reporting purposes.
  • FILENAME statement for Hadoop and DATA step

5.2 Business Scenario Pseudocode

proc hadoop...;
  hdfs copyfromlocal="local file" out="hdfs file";
run;

proc hadoop...;
  mapreduce input="hdfs file" output="hdfs outfile" ...;
run;

filename fileref "hdfs outfile"...;
  data somedata;
  infile fileref input...;
  ...
run;

5.2.1 Breakdown 1

  1. Move unstructured text files into the Hadoop file system
proc hadoop...;
  hdfs copyfromlocal="local file" out="hdfs file";
run;

5.2.2 Breakdown 2

  1. Input the hdfs file to MapReduce program and output results to hdfs file
proc hadoop...;
  mapreduce input="hdfs file" output="hdfs outfile" ...;
run;

5.2.3 Breakdown 3

  1. Read the MapReduce output with SAS for further processing
filename fileref "hdfs outfile"...;
  data somedata;
  infile fileref input...;
  ...
run;

5.3 PROC HADOOP

  • PROC HADOOP submits:
    • Hadoop file system (HDFS) commands
    • MapReduce programs
    • PIG language code
PROC HADOOP <Hadoop-server-option(s)>;
  HDFS <Hadoop-server-option(s)> <hdfs-command-option(s)>;
  MAPREDUCE <Hadoop-server-option(s)> <mapreduce-option(s)>;
  PIG <Hadoop-server-option(s)> <pig-code-option(s)>;
  PROPERTIES <configuration-properties>;
RUN;

5.4 HDFS Statements

HDFS COPYFROMLOCAL="local-file" OUT="output-location" <DELETESOURCE> <OVERWRITE>;

HDFS COPYTOLOCAL="HDFS-file" OUT="output-location" <DELETESOURCE> <OVERWRITE> <KEEPCRC>;

HDFS DELETE="HDFS-file" <NOWARN>;

HDFS MKDIR="HDFS-path";

HDFS RENAME="HDFS-file" OUT="new-name";

5.5 Moving a File from SAS to Hadoop

  • This program creates a directory in the Hadoop file system (HDFS) and copies a file from the SAS server to the new HDFS directory.
filename hadconfg "/work/hadoop_config.xml";
proc hadoop options=hadconfg username="&std" verbose;
  hdfs mkdir="/user/&std/data";
  hdfs copyfromlocal="/work/DIACCHAD/data/moby_dick_via_sas.txt" out="/user/&std/data";
run;

5.6 Execute MapReduce Code

MAPREDUCE <Hadoop-server-option(s)> <mapreduce-option(s)>;
proc hadoop options=hadconfg username="&std";
  mapreduce
    input="source-file"
    output="target-file"
    jar="jar file containing MapReduce code"
    outputkey="output key class (in MapReduce code)"
    outputvalue="output value class"
    reduce="reducer class"
    combine="combiner class"
    map="map class";
run;

5.6.1 Breakdown 1

input="source-file"
  • The input file in HDFS the MapReduce program reads
output="target-file"
  • The output file in HDFS the MapReduce program writes to

5.6.2 Breakdown 2

jar="jar file containing MapReduce code"
  • The JAR file containing the MapReduce program and named classes

5.6.3 Breakdown 3

outputkey="output key class (in MapReduce code)"
  • The name of the output key class in dot notation
outputvalue="output value class"
  • The name of the output value class in dot notation

5.6.4 Breakdown 4

reduce="reducer class"
combine="combiner class"
map="map class";
  • The Java classes in the map reduce program that execute the map, reduce and combine steps

5.7 MapReduce Example

  • In the demonstration, PROC HADOOP will be used to invoke a MapReduce program that will do the following:
    • read a text file containing free unstructured text. This file can be distributed in multiple data nodes in Hadoop
    • parse the text into the individual words in each data node
    • count up the number of instances of each unique word found in each data node
    • combine total counts for each unique word across nodes
    • write the results to an HDFS output file
proc hadoop options=hadconfg username="&std" verbose;
  mapreduce
    jar = "<hdfs path>/hadoop-mr1-cdh.jar"
    input = "<hdfs path>/moby_dick_via_sas.txt"
    map = "org.apache.hadoop.examples.WordCount$TokenizerMapper"
    reduce = "org.apache.hadoop.examples.WordCount$IntSumReducer"
    combine = "org.apache.hadoop.examples.WordCount$IntSumReducer"
    outputkey = "org.apache.hadoop.io.Text"
    outputvalue = "org.apache.hadoop.io.IntWritable"
    output = "<hdfs path>/mapoutput"
  ;
run;

5.7.1 Breakdown 1

  1. Read a text file containing free unstructured text
input = "<hdfs path>/moby_dick_via_sas.txt"

5.7.2 Breakdown 2

  1. In parallel in each data node, parse the text into the individual words
map = "org.apache.hadoop.examples.WordCount$TokenizerMapper"

5.7.3 Breakdown 3

  1. In each data node, in parallel, count up the number of instances of each unique word found
reduce = "org.apache.hadoop.examples.WordCount$IntSumReducer"

5.7.4 Breakdown 4

  1. Combine the counts for each unique word across data nodes to find final counts
combine = "org.apache.hadoop.examples.WordCount$IntSumReducer"

5.7.5 Breakdown 5

  1. The output contains each unique word (outputkey) as Text and the number of times it occurred in the input file as Integer
outputkey = "org.apache.hadoop.io.Text"
    outputvalue = "org.apache.hadoop.io.IntWritable"

5.7.6 Breakdown 6

  1. The HDFS location the output is written to
output = "<hdfs path>/mapoutput"

5.8 The FILENAME Statement for Hadoop

  • In SAS, the FILENAME statement associates a fileref with an external file and an output device or access method.
FILENAME fileref <device type or access method> "external file" <options>;
filename in hadoop "Hadoop-file-path" concat cfg=xml-config-file user="&std";

5.8.1 Breakdown 1

in
  • file reference

5.8.2 Breakdown 2

hadoop
  • Access method

5.8.3 Breakdown 3

"Hadoop-file-path"
  • The directory containing the concatenated files to read

5.8.4 Breakdown 4

concat
  • Specifies to read each file in the directory defined by the Hadoop file path
  • Note: This option is valid only for reading (not writing) Hadoop files with the FILENAME statement.

5.8.5 Breakdown 5

cfg=xml-config-file
  • Points to the location of the Hadoop configuration file on the machine where SAS is executing.

5.8.6 Breakdown 6

user="&std"
  • The user ID to connect to Hadoop

5.9 Reading a Hadoop File with a DATA Step

filename hadconfg "/work/hadoop_config.xml";

filename mapres hadoop "/user/&std/data/mapoutput" concat cfg=hadconfg user="&std";

data work.commonwords;
  infile mapres dlm="09"x;
  input word $ count;
  ...
run;
  • hadconfg <=====> cfg=hadconfg
  • mapres <=====> infile mapres
  • "09"x is the hex code constant for the tab character

5.9.1 Another Example

filename hadconfg "/work/hadoop_config.xml";
filename orders hadoop "/user/shared/data/custorders.txt" cfg=hadconfg user="&std";

data work.custorders;
  infile orders;
  input @1 customer_id 8. ...;
run;
proc print data=work.custorders;
run;
  • Does the DATA step read a single HDFS file or a concatenated directory?

5.10 Base SAS: FILENAME for Hadoop (Review)

5.11 Reading a Hadoop File with a DATA

5.11.1 Step 1

5.11.2 Step 2

5.11.3 Step 3

5.12 Writing a Hadoop File with a DATA

filename hadconfg "/work/hadoop_config.xml";
filename out hadoop "/user/&std/data/custord" dir cfg=hadconfg user="&std";

data _null_;
  set work.custorders;
  file out(corders) dlm=",";
  put customer_id
      country
      gender
      birth_date
      product_id
      order_date
      quantity
      costprice_per_unit;
run;
  • dir: To write to files in the directory specified by the Hadoop file path
  • corders: Create a file called corders in the HDFS directory "/user/&std/data/custord"

5.12.1 Step 1

5.12.2 Step 2

5.12.3 Step 3

5.13 Use Ambari to Browse the Hadoop File System

6 Part 3: Executing Pig Code With PROC HADOOP

6.1 Executing Pig Code with PROC HADOOP

PIG CODE=fileref | "external-file"
    PARAMETERS=fileref | "external-file"
    REGISTERJAR="external-file(s)"
;
filename pigcode "/workshop/DIACCHAD/pigcode.txt";

proc hadoop options=hadconfg username="hdfs" verbose;
  pig code=pigcode;
run;
  • pigcode.txt:
A = LOAD '/user/shared/data/custord'
    USING PigStorage (',')
    AS (customer_id, country, gender, birth_date, product_id,
        order_date, quantity, costprice_per_unit);

B = FILTER A BY gender == 'F';

store B into '/user/shared/data/student1';

6.1.1 Pig Code Breakdown 1

A = LOAD '/user/shared/data/custord'
    USING PigStorage (',')
    AS (customer_id, country, gender, birth_date, product_id,
        order_date, quantity, costprice_per_unit);
  • Load the comma delimited file and name each field

6.1.2 Pig Code Breakdown 2

B = FILTER A BY gender == 'F';
  • Subset the loaded file for records where gender = ‘F’

6.1.3 Pig Code Breakdown 3

store B into '/user/shared/data/student1';
  • Store the results in the Hadoop file system directory indicated

7 Using the SQL Pass-Through Facility

7.1 SAS/ACCESS Interface to Hadoop

7.2 SQL Pass-Through Query Example

proc sql;
  connect to hadoop (server=namenode port=10000 subprotocol=hive2 schema=diacchad user="&std");
    select * from connection to hadoop
      (select employee_name, salary
        from salesstaff
        where emp_hire_date between '2011-01-01' and '2011-12-31'
      );
  disconnect from hadoop;
quit;
  • Note: The query in the bracket is sent directly to Hive and is executed as a HiveQL query by Hive.

7.3 Joining Multiple Tables from Different Databases

proc sql;
  connect to hadoop (server=namenode subprotocol=hive2 port=10000 schema=diacchad user="&std");
    select * from connection to hadoop
      (select st.employee_id,
              employee_name,
              st.job_title,
              emp_hire_date,
              st.salary,
              concat(ltrim(last_name), ', ' , ltrim(first_name)
      ) as mgrname
        from salesstaff st join diacch2.sales s
        where manager_id=s.employee_id and emp_hire_date between '2011-01-01' and '2011-12-31'
      );
  disconnect from hadoop;
quit;

7.4 Creating a SAS File from Hive Results

  • Partial PROC SQL code:
proc sql;
  connect to hadoop (server=namenode subprotocol=hive2 ...);
  create table manager_rep_list as
  select
    MgrName as Manager_Name
      label='Manager Name',
    Employee_ID
      label='ID Number',
    Employee_Name
      label='Name',
    Job_Title
      label='Job Title',
    input(Emp_Hire_Date, yymmdd10.) as Emp_Hire_Date
      format=mmddyyp10.
      label='Hired Date',
    Salary
      format=dollar12.
  from connection to hadoop
    (select ... );

7.5 Using HiveQL DDL Statements in SAS

  • With SQL pass-through EXECUTE statements, Hive tables can be defined using HiveQL Data
proc Definition Language (DDL) sql;
  connect to hadoop (connection options);
    execute (create table customer
             (customer_id int,
               country string,
               gender string,
               birthdate string)
             row format delimited
             fields terminated by '\001'
             stored as textfile
             location "/user/&std/data/customer")
    by hadoop;
  disconnect from Hadoop;
quit;

8 Using the SAS/ACCESS LIBNAME Engine

9 The LIBNAME Statement (Review)

LIBNAME libref 'SAS-data-library' <options>;

10 The SAS/ACCESS LIBNAME Statement

libname hivedb hadoop server=namenode
        subprotocol=hive2
        port=10000 schema=diacchad
        user=studentX pw=StudentX;
LIBNAME libref engine-name <connection-options>
<LIBNAME-options>;
23    libname hivedb hadoop server=namenode
24        subprotocol=hive2
25        port=10000 schema=diacchad
26        user="&std" pw="&stdpw";
NOTE: Libref HIVEDB was successfully assigned as follows:
  Engine: HADOOP
  Physical Name: jdbc:hive2://namenode:10000/diacchad

11 LIBNAME Statement Connection Options

Option Specifies
USER= Hive user name
PW= Hive password associated with the Hive user
SERVER= Hadoop server machine name or IP address
SUBPROTOCOL= The version of Hive that is running
SCHEMA= The Hive schema to access
PORT Port for connection to server

12 Listing of Hive Tables in the Schema

proc contents data=hivedb._all_ nods;
run;

13 Contents of a HiveTable

proc contents data=hivedb.customerorders;
run;

14 Listing of a Hive Table

proc print data=hivedb.customerorders;
  var customer_id
      order_type
      order_date
      product_id
      quantity;
  where order_type=3 and
      order_date between '01Jan2011'd and '15Jan2011'd;
run;

15 SAS/ACCESS Engine Implicit Pass-through

16 Implicit versus Explicit SQL Pass-through

17 Optimizing Implicit Pass-through

18 The SASTRACE= SAS System Option

SASTRACE="Value" Description
",,,d" specifies that all SQL statements sent to the DBMS are sent to the log.
",,,s" specifies that a summary of timing information for calls made to the DBMS is sent to the log.
OPTIONS SASTRACE="value" <options>;
options sastrace=",,,d" nostsuffix sastraceloc=saslog;
proc print data=hivedb.customerorders;
run;
options sastrace=off;

19 NOSTSUFFIX SAS System Option

OPTIONS NOSTSUFFIX;
options sastrace=",,,d" nostsuffix sastraceloc=saslog;
proc print data=hivedb.customerorders;
run;
options sastrace=off;

20 SASTRACELOC= SAS System Options

OPTIONS SASTRACELOC=STDOUT|SASLOG|FILE "filename";
options sastrace=",,,d" nostsuffix sastraceloc=saslog;
proc print data=hivdb.customerorders;
run;
options sastrace=off;

21 SASTRACE= Messages

22 Using the MEANS and FREQ Procedures

options sastrace=',,,d' sastraceloc=saslog nostsuffix;

proc means data=hivedb.order_fact sum mean;
  var total_retail_price;
run;

proc freq data=hivedb.order_fact;
  tables order_type;
run;

options sastrace=off;
NOTE: SQL generation will be used to perform the initial summarization.

HADOOP_41: Executed: on connection 7
  select T1.ZSQL1, T1.ZSQL2, T1.ZSQL3, T1.ZSQL4 from
        ( select COUNT(*) as ZSQL1, COUNT(*) as ZSQL2,
          COUNT(TXT_1.`total_retail_price`) as ZSQL3,
          SUM(TXT_1.`total_retail_price`) as ZSQL4
          from `ORDER_FACT` TXT_1 ) T1
                  where T1.ZSQL1 > 0

ACCESS ENGINE: SQL statement was passed to the DBMS for fetching data.
NOTE: SQL generation will be used to construct frequency and crosstabulation tables.

HADOOP_53: Executed: on connection 7
  select COUNT(*) as ZSQL1, case when COUNT(*) > COUNT(TXT_1.`order_type`) then NULL else
            MIN(TXT_1.`order_type`) end as ZSQL2,
            MAX(TXT_1.`order_type`) as ZSQL3 from `ORDER_FACT` TXT_1
            group by TXT_1.`order_type`

ACCESS ENGINE: SQL statement was passed to the DBMS for fetching data.

23 Supported SAS Language Functions

24 Using a Supported SAS Function

options sastrace=",,,d" sastraceloc=saslog nostsuffix;

/* Orders Placed in 2011 */
data orders2011;
  set hivedb.customerorders;
  where year(order_date)=2011;
run;

options sastrace=off;

25 Using a Non-Supported Function

options sastrace=",,,d" sastraceloc=saslog nostsuffix;

/* Orders with Saturday Delivery */
data sat_deliveries;
  set hivedb.customerorders ;
  where weekday(delivery_date)=7;
run;

options sastrace=off;

26 Using SAS Data Set Options

Option Function
KEEP=column1 column2 Lists the column names to include in processing or writing to output tables.
DROP=column1 column2 Lists the column names to exclude in processing or writing to output tables.
OBS=n Specifies the number of the last observation to process.
FIRSTOBS=n Specifies the first observation to begin reading. By default, FIRSTOBS=1.
RENAME=(oldname=newname) Enables you to change the name of columns in output data sets.

27 Selected SAS Data Set Options

data salesrep_i;
  set hivedb.salesstaff
    (drop=ssn birth_date
      rename=(job_title=jobcode));
  where jobcode="Sales Rep. I";
run;

28 SASDATEFMT= SAS/ACCESS Data Set Option

proc print data=hivedb.salesstaff
           (sasdatefmt=(birth_date="mmddyyp10."
                        emp_hire_date="date11."));
  var employee_id
      employee_name
      birth_date
      emp_hire_date;
  where birth_date < "01jan1955"d;
run;

29 Creating a SAS PROC SQL View

PROC SQL;
  CREATE VIEW view-name AS
    SELECT column-1, column-2, ...column-n
      FROM table-1<,table-n>
      ...;
QUIT;
libname hivedb hadoop
        server=namenode
        subprotocol=hive2
        port=10000
        schema=diacchad
        user="&std"
        pw="&stdpw";
proc sql;
  create view mysasdat.US_F_customers as
    select customer_id,
           customer_name,
           birth_date format=year4.,
           customer_type_id label='Customer Type'
      from hivedb.customer
      where country='US' and gender='F';
quit;
NOTE: SQL view mysasdat.US_F_CUSTOMERS has been defined.

30 Using a SAS PROC SQL View

proc freq data=mysasdat.US_F_Customers;
  tables customer_type_id birth_date/nocum;
run;

31 Ways to Combine Data

32 Passing Joins to Hive

33 Joining Tables from a Single Connection

libname hivedb hadoop
        subprotocol=hive2
        server=namenode
        port=10000
        schema=diacchad
        user="&std"
        pw="&stdpw";
proc sql;
  create table manager_rep_list as
    select st.employee_iD,
           employee_name,
           st.job_title,
           emp_hire_date,
           st.salary,
           trim(last_name)|| ', ' || trim(first_name) as mgrname
      from hivedb.salesstaff st, hivedb.sales s
      where manager_id=s.employee_id and
            emp_hire_date between '01JAN2011'd and '31DEC2011'd;
quit;

34 Joining Tables from Multiple Connections

libname hivedb hadoop
        subprotocol=hive2
        server=namenode
        port=10000
        schema=diacchad
        user="&std"
        pw="&stdpw";
libname hived2 hadoop
        subprotocol=hive2
        server=namenode
        port=10000
        schema=diacch2
        user="&std"
        pw="&stdpw";
proc sql;
  create table manager_rep_list as
    select st.employee_iD,
           employee_name,
           st.job_title,
           emp_hire_date,
           st.salary,
           trim(last_name)|| ', ' || trim(first_name) as mgrname
      from hivedb.salesstaff st,
           hived2.sales s
      where manager_id=s.employee_id and
            emp_hire_date between '01JAN2011'd and '31DEC2011'd;
quit;

35 SQL Set Operators

36 Stacking Tables Using Set Operators

proc sql number;
  create table type2 as
    select order_id,
           customer_id,
           order_type,
           order_date,
           delivery_date
      from hivedb.qtr1
      where order_type=2
    union
    select order_id,
           customer_id,
           order_type,
           order_date,
           delivery_date
      from hivedb.qtr2
      where order_type=2;
  select * from type2;
quit;

37 Combining Hive Tables with Other Sources

38 Combining SAS Data Set and Hive Table

libname sasdata "/workshop/DIACCHAD";
libname hivedb hadoop
        server=namenode
        subprotocol=hive2
        port=10000
        schema=diacchad
        user="&std"
        pw="&stdpw";
data phonelist;
  merge sasdata.employee_phones(in=p)
        hivedb.salesstaff(in=s);
  by employee_id;
  if p=1 and s=1;
  keep employee_id
       employee_name
       phone_type
       phone_number;
run;

39 Copying Data Sets to Hive

libname sasdat "/workshop/DIACCHAD";
libname myhive hadoop
        server=namenode
        subprotocol=hive2
        port=10000
        schema=&std
        user="&std"
        pw="&stdpw";
proc copy in=sasdat out=myhive;
  select customer_dim;
run;